Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implemented performance speedup for binary ReliefF + bug fixes #79

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

CaptainKanuk
Copy link

@CaptainKanuk CaptainKanuk commented Dec 16, 2021

Context

TL:DR - I was able to implement some significant performance improvements for ReliefF on binary + discrete data.
For a GAMETES generated binary class discrete data file with 3200 rows and 400 attributes, the statistics are:
-ReliefF w/ 10 neighbors Before: 59.05 seconds (mean) | 0.73 seconds (std) (n=4 runs)
-ReliefF w/ 10 neighbors After: 12.68 seconds (mean) | 0.65 seconds (std) (n=4 runs)

Process

I set out to see whether I could speed up the implementation of these algorithms. I had a few hypotheses:

  • The algorithms are currently overindexed on code-deduplication, with central functions in scoring_utils.py that try to serve all types of algorithms
  • We can improve parallelization of the algorithms
  • We can strategically use numba to speed up frequently called functions
  • We can optimize functions to rely more on numpy array math

My general plan of action was as follows:

  • Write a simple performance benchmarking tool for the library.
  • Pick one algorithm and data type case (continuous vs. discrete features, etc.)
  • Create one optimized scoring function for that (algorithm / data) case using numpy / numba

Changes in This PR

For benchmarking I wrote a simple tool that takes some of the common test cases and runs timeit repeatedly. I drop the results of the first run to avoid measuring compilation overhead, which can be quite high for numba when we're using small testing datasets. The tool will print results but also dump them to a csv file in the same directory. To run this tool, python performance_tests.py Additionally - you can run a parameter sweep to estimate how performance varies across different parameter values with python performance_tests.py sweep.

I also wrote a simple shell script to automate the generation of profiling graphs using Graphviz. This is available at ./run_performance_benchmark.sh and creates a performance csv for selected test cases, a python cprofile data file, and a png performance graph.

Using these tools I confirmed that the bulk of the runtime was in finding relevant neighbors for the relief algorithm and scoring. This is unsurprising as those functions are called roughly O(num_rows ^ 2) times and were not currently performance optimized.

Additional Notes & Results

In the interests of time I focused on a proof of concept for ReliefF with binary features and discrete data. The main scoring functions in scoring_utils.py were indeed very complex, with a lot of switch statements and non-performant operations. There is an elegance to the centralization, but I think dedicated scoring functions are not only clearer to a reader but also allow for dedicated optimization. My approach was to create a dedicated compute_binary_score function which uses optimized numpy functions. I also parallelized the selection of neighbors, and optimized the neighbor selection function using numba. Doing this caused me to have to refactor some code, but I aimed to try and keep the contracts the same so that the rest of the codebase would not be affected.

Lastly - all tests were failing when I checked out the repo. I determined this was due to Imputer being deprecated from sklearn. I updated it to the new SimpleImputer. This caused only 5 tests to fail, but looking at recent repo history I don't believe all tests had been passing. It seems to be a more general bug with how the algorithms treat data with a mixture of contiuous and discrete attributes. I did not attempt to fix this problem.

@CaptainKanuk CaptainKanuk force-pushed the performance-improvements branch from b1e8b5f to 5d323e7 Compare December 29, 2021 00:49
# nopython = True will prevent a silent fallback if the function cannot
# be numba compiled for whatever reason.
@jit(nopython=True)
def find_neighbors_binary(
Copy link
Author

@CaptainKanuk CaptainKanuk Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the performance improvements for neighbor identification. We modify the function to find neighbors - add numba compilation, and add batch parallelization.

@@ -113,6 +113,42 @@ def find_neighbors_binary(
return np.array(nn_list)


# interestingly, this is sometimes faster without numba
def compute_score_binary(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core of the performance improvements for score computation - it expresses the computation as a sequence of numpy matrix operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant